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Abstract 

An important form of prior information in 
clustering comes in form of cannot-link and 
must-link constraints. We present a gen¬ 
eralization of the popular spectral cluster¬ 
ing technique which integrates such con¬ 
straints. Motivated by the recently proposed 
1-spectral clustering for the unconstrained 
problem, our method is based on a tight re¬ 
laxation of the constrained normalized cut 
into a continuous optimization problem. Op¬ 
posite to all other methods which have been 
suggested for constrained spectral clustering, 
we can always guarantee to satisfy all con¬ 
straints. Moreover, our soft formulation al¬ 
lows to optimize a trade-off between normal¬ 
ized cut and the number of violated con¬ 
straints. An efficient implementation is pro¬ 
vided which scales to large datasets. We 
outperform consistently all other proposed 
methods in the experiments. 


1 Introduction 

The task of clustering is to find a natural grouping of 
items given e.g. pairwise similarities. In real world 
problems, such a natural grouping is often hard to dis¬ 
cover with given similarities alone or there is more 
than one way to group the given items. In either 
case, clustering methods benefit from domain knowl¬ 
edge that gives bias to the desired clustering. Wagstaff 
et. al m are the first to consider constrained clus¬ 
tering by encoding available domain knowledge in the 
form of pairwise must-link (ML, for short) and cannot- 
link (CL) constraints. By incorporating these con- 
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straints into k -means they achieve much better per¬ 
formance. Since acquiring such constraint information 
is relatively easy, constrained clustering has become 
an active area of research; see [I] for an overview. 

Spectral clustering is a graph-based clustering algo¬ 
rithm originally derived as a relaxation of the NP- 
hard normalized cut problem. The spectral relaxation 
leads to an eigenproblem for the graph Laplacian, see 
mmm- However, it is known that the spectral re¬ 
laxation can be quite loose |6]. More recently, it has 
been shown that one can equivalently rewrite the dis¬ 
crete (combinatorial) normalized Cheeger cut problem 
into a continuous optimization problem using the non¬ 
linear 1-graph Laplacian mug which yields much bet¬ 
ter cuts than the spectral relaxation. In further work 
it is shown that even all balanced graph cut problems, 
including normalized cut, have a tight relaxation into 
a continuous optimization problem [9]. 

The first approach to integrate constraints into spec¬ 
tral clustering was based on the idea of modifying the 
weight matrix in order to enforce the must-link and 
cannot-link constraints and then solving the resulting 
unconstrained problem m- Another idea is to adapt 
the embedding obtained from the first k eigenvectors 
of the graph Laplacian [12] . Closer to the original nor¬ 
malized graph cut problem are the approaches that 
start with the optimization problem of the spectral re¬ 
laxation and add constraints that encode must-links 
and cannot-links m min on- Furthermore, the case 
where the constraints are allowed to be inconsistent is 
considered in |4]. 

In this paper we contribute in various ways to the 
area of graph-based constrained learning. First, we 
show in the spirit of 1-spectral clustering mm> that 
the constrained normalized cut problem has a tight re¬ 
laxation as an unconstrained continuous optimization 
problem. Our method, which we call COSC, is the 
first one in the field of constrained spectral clustering, 
which can guarantee that all given constraints are ful¬ 
filled. While we present arguments that in practice it 
is the best choice to satisfy all constraints even if the 
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data is noisy, in the case of inconsistent or unreliable 
constraints one should refrain from doing so. Thus 
our second contribution is to show that our frame¬ 
work can be extended to handle degree-of-belief and 
even inconsistent constraints. In this case COSC op¬ 
timizes a trade-off between having small normalized 
cut and a small number of violated constraints. We 
present an efficient implementation of COSC based on 
an optimization technique proposed in [9j which scales 
to large datasets. While the continuous optimization 
problem is non-convex and thus convergence to the 
global optimum is not guaranteed, we can show that 
our method improves any given partition which satis¬ 
fies all constraints or it stops after one iteration. 


Then the constrained normalized cut problem is 

to minimize NCut(C, C) over all consistent partitions. 
If the constraints are unreliable or inconsistent one can 
relax this problem and optimize a trade-off between 
normalized cut and the number of violated constraints. 
In this paper, we address both problems in a common 
framework. 

We define the set functions, M, N : 2 V M, as 
M(C):= 2 

iecjec 

ti(C) ■= q ij + E q ij = Vol (Q C ) - 2 E q iy 

iecjec iecjec iecjec 


All omitted proofs and additional experimental results 
can be found in the supplementary material. 

Notation. Set functions are denoted by a hat, S, 
while the corresponding extension is S. In this paper, 
we consider the normalized cut problem with general 
vertex weights. Formally, let G(V, E,w,b) be an undi¬ 
rected graph G with vertex set V and edge set E to¬ 
gether with edge weights w : V x V —> M + and vertex 
weights b : V -J M + and n = \V\. Let C C V and 
denote by C = V\C. We define respectively the cut, 
the generalized volume and the normalized cut (with 
general vertex weights) of a partition (C, C) as 


cut((7, C) — 2 


bal (C) = 2 


iec,jec 

gvol(C)gvol(C) 
gvol(F) ’ 


gvol(C') = y^bj, 

iec 


NCut(C, C) 


cut(C, C) 
bal (C) 


We obtain ratio cut and normalized cut for spe¬ 
cial cases of the vertex weights, bi = 1, and bi = 
di, where di = Yj=i w iJ respectively. In the ratio 
cut case, gvol(C) is the cardinality of C and in the 
normalized cut case, it is volume of C. denoted by 
vol(C). 


M(C ) and N(C ) are equal to twice the number of 
violated must-link and cannot-link constraints of par¬ 
tition (C, C). 

As we show in the following, both the constrained nor¬ 
malized cut problem and its soft version can be ad¬ 
dressed by minimization of F 7 : 2 V -J M defined as 


k(c) 


cut(C, C)+ 7 (M(C) + N(C )) 
bal(C') 


( 1 ) 


where 7 G M+. Note that F 7 (C) = NCut(C, C) if 
(C,C) is consistent. Thus the minimization of F 7 (C) 
corresponds to a trade-off between having small nor¬ 
malized cut and satisfying all constraints parameter¬ 
ized by 7 . 

The relation between the parameter 7 and the number 
of violated constraints by the partition minimizing F 7 
is quantified in the following lemma. 

Lemma 2.1. Let (C,C) be consistent and A = 
NCut(C, C). If 7 > s 4 V q+i)^ ^ then any minimizer of 
F 1 violates no more than l constraints. 


2 The Constrained Normalized Cut 
Problem 


Proof. Any partition (C, C) that violates more than l 
constraints satisfies M(C ) + N(C) > 2(1 + 1 ) and thus 


We consider the normalized cut problem with must- 
link and cannot-link constraints. Let G(V, E , w, b) de¬ 
note the given graph and Q m , Q c be the constraint 
matrices, where the element g™ (or q^) G {0,1} speci¬ 
fies the must-link (or cannot-link) constraint between i 
and j. We will in the following always assume that G is 
connected. All what is stated below and our suggested 
algorithm can be easily generalized to degree of belief 
constraints by allowing q™ (and q^-) G [0,1]. How¬ 
ever, in the following we consider only qij (and q^) 
G {0,1}, in order to keep the theoretical statements 
more accessible. 

Definition 2.1. We call a partition (C,C) consis¬ 
tent if it satisfies all constraints in Q m and Q c . 


F 1 (C) = NCut(C, C) +7 


M(C) + N(C) 
bal(C) 


> NCut(C7,C) + > M+l), 

^gvol(y) gvol(F) 


where we have used that max^ bal(C) = gvol(T )/2 
and NCut(C, C) > 0 as the graph is connected. As¬ 
sume now that the partition (D, D ) minimizes F 7 for 
7 = Yp+io ^ anc ^ violates more than l constraints. 
Then 


F^D) > 


47(Z + 1) 

gvol(F) 


> A = F 7 (C), 


which leads to a contradiction. 


□ 
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Note that it is easy to construct a partition which is 
consistent and thus the above choice of 7 is construc¬ 
tive. The following theorem is immediate from the 
above lemma for the special case 1 = 0. 

Theorem 2.1. Let ((7, C ) be consistent with the given 
constraints and A = NCut ((7,(7). Then for 7 > 
gvopy) ^ ^ holds that 

argmin NCut ((7,(7) = argminF 7 ((7) 

_CCV: ccv 

(C,C) consistent 

and the optimum values of both problems are equal. 


Proof. We have, 


N w ij\Oc)i - (lc )j I = cut(C, C) 
,j =1 

1 


B(lc 
B 


<lc,6)l) 


gvol(F) 
gvol(C), gvol(C) \ 

v evoirvy c Kvoirv) G ) 


This together with the discussion on M, N finishes the 
proof. □ 


Thus the constrained normalized cut problem can be 
equivalently formulated as the combinatorial problem 
of minimizing F y . In the next section we will show 
that this problem allows for a tight relaxation into a 
continuous optimization problem. 


2.1 A tight continuous relaxation of F 1 

Minimizing F 1 is a hard combinatorial problem. In 
the following, we derive an equivalent continuous opti¬ 
mization problem. Let / : MX —)> R denote a function 
on V, and 1 c denote the vector that is 1 on (7 and 0 
elsewhere. Define 

n 

M(f ) := £ CI/*-/il and 

i,j =1 

n 

N(f) := vol(Q c )(max(/) - min(/)) - ^ \ fc - f ,\, 

where max(/) and min(/) are respectively the maxi¬ 
mum and minimum elements of /. Note that M(lc) = 
M(C) and N(lc) = N(C) for any non- triviajj parti¬ 
tion ((7, (7). 

Let B denote the diagonal matrix with the vertex 
weights b on the diagonal. We define 


FM) = 


EL -=1 w a \fi - fj I + 7 M{f) + 7 iv(/) 




We denote the numerator of F 1 (f) by R 7 (f) and the 
denominator by S(f). 

Lemma 2.2. For any non-trivial partition it holds 
that F 7 (C) = F 7 (1c). 


1 A partition ((7, (7) is non-trivial if neither (7 = 0 nor 
C = V. 


From Lemma [272] it immediately follows that minimiz¬ 
ing F 1 is a relaxation of minimizing F 1 . In our main 
result (Theorem 2.2), we establish that both problems 
are actually equivalent, so that we have a tight relax¬ 
ation. In particular a minimizer of F 1 is an indicator 
function corresponding to the optimal partition mini¬ 
mizing Fry. 


The proof is based on the following key property of 
the functional Fj. Given any non-constant / G R n , 
optimal thresholding, 

Cf = argmin F 7 (Cj), 

mini fi < t < maxj fi 

where Cj = {i G V\ fi > £}, yields an indicator func¬ 
tion on some CJ C V with smaller or equal value of 
Fy 

Theorem 2.2. For 7 > 0, we have 

min F^ (C) = min F7(f). 

CC.V /GM n , / non-constant 

Moreover, a solution of the first problem can be ob¬ 
tained from the solution of the second problem. 


Proof. It has been shown in [ 8 ], that 

n /'OO _ 

^2 Wiilfi ~ fi I = / cut(Cf,C})dt 

i,j=l J-00 

We define P : 2 y -> R as P(<7) = 1 , if (7 X V and 
(7^0, and 0 otherwise. Denoting by cutg((7, (7), the 
cut on the constraint graph whose weight matrix is 
given by Q, we have 


R 


/ OO _ POO _ 

CUt(C t f ,Cj)dt+'r / cut gm(C5,q) 
-00 J —OO 


maxj Ji poo _ 

+ 7 vol(Q c ) / ldt — j cutQc(C}, Cf) 

J mini fi J —00 

/ OO POO _ 

cut((7}, Cj)dt + 7 / cutQm((7/, Cf) 

-00 J —00 

/ OO /“OO _ 

P{C})dt — 7 / cutQ(C/,C}) 

-00 J —00 

/ OO 

Ry(Cf) dt 

-OO 

























Constrained 1-Spectral Clustering 


Note that S(f) is an even, convex and positively one- 
homogeneous function^ Moreover, every even, convex 
positively one-homogeneous function, T : —» M has 

the form T(/) = swp ueu (u, /), where U is a symmetric 
convex set, see e.g., m■ Note that 5(1) = 0 and thus 
because of the symmetry of U it has to hold (u, 1 ) = 0 

for all u £ U. Since 5(l^t) = S(C\) and (u, /) < 
S(f),u G U , we have for all u £ U, 



where in the last inequality we changed the limits of 
integration using the fact that (u, 1) = 0. Let Ci := 
C). and Co = V. Then 


maxj Ji ™± 

/ {u,l Ci )dt = Y (u, lc 4 ) (fi+i ~ fi) = 

J mini fi i = l 

n n 

Y fi ( ( W ’ lc i-l } - («> 1 C i ) ) = Y f iUi = f u ’ f) 


Proof. From Theorem |2.1| we know that, for the chosen 
value of 7 , the constrained problem is equivalent to 


min F~(C), 
ccv rv 7 


which in turn is equivalent, by Theorem 2.2, to the 
right problem in the statement. Moreover, as shown in 
Theorem 12.21 minimizer of Fb is an indicator function 


on C C V and hence we immediately get an optimal 
partition of the constrained problem. □ 


A few comments on the implications of Theorem |2.3| 
First, it shows that the constrained normalized cut 
problem can be equivalently solved by minimizing 
Fry(f) for the given value of 7 . The value of 7 depends 
on the normalized cut value of a partition consistent 
with given constraints. Note that such a partition can 
be obtained in polynomial time by 2 -coloring the con¬ 
straint graph as long as the constraints are consistent. 


2.2 Integration of must-link constraints via 
sparsification 


Noting that © holds for all u 6 U, we have 

Krif) > mf F,(C}) sup (u, f) = inf F^C}) S(f). 

u(zU 

This implies that 

Fy(f) >inf# 7 (C}) = F 7 (l c; ), (3) 

where CJ = argmin A, ( Cj ). 

mini fi < t < max* fi 

This shows that we always get descent by optimal 
thresholding. Thus the actual minimizer of F 1 is a 
two-valued function, which can be transformed to an 
indicator function on some C C V, because of the scale 
and shift invariance of F 7 . Then from Lemma | 2 . 2 [ 
which shows that for non-trivial partitions, F 7 (C) = 
F 7 (lc), the statement follows. □ 


If the must-link constraints are reliable and therefore 
should be enforced, one can directly integrate them 
by merging the corresponding vertices together with 
re-definition of edge and vertex weights. In this way 
ones derives a new reduced graph, where the value of 
the normalized cut of all partitions that satisfy the 
must-link constraints are preserved. 

The construction of a reduced graph is given below for 
a must-link constraint (p,q)- 

1 . merge p and q into a single vertex r. 

2 . update the vertex weight of r by b r = b p + b q . 

3. update the edges as follows: if r is any vertex 
other than p and q , then add an edge between r 
and r with weight w(p, r ) + w(q , r). 


Now, we state our second result: the problem of min¬ 
imizing the functional F 7 over arbitrary real-valued 
non-constant /, for a particular choice of 7 , is in fact 
equivalent to the NP-hard problem of minimizing nor¬ 
malized cut with constraints. 

Theorem 2.3. Let (C, C) be consistent and X = 
NCut(C", C* 7 ). Then for 7 > u holds that 

min NCut(C, C) = min F 7 (f) 

CC.V: /eM n , / non-constant 

( C,C ) consistent 

Furthermore, an optimal partition of the constrained 
problem can be obtained from a minimizer of the right 
problem. 

2 A function 5 : R v M is positively one-homogeneous 

if S(af) = a S(f) for all a > 0 . 


Note that this construction leads to a graph with 
vertex weights even if the original graph had vertex 
weights equal to 1. If there are many must-links, one 
can efficiently integrate all of them together by first 
constructing the must-link constraint graph and merg¬ 
ing each connected component in this way. 

The following lemma shows that the above construc¬ 
tion preserves all normalized cuts which respect the 
must-link constraints. We prove it for the simple case 
where we merge p and q and the proof can easily be 
extended to the general case by induction. 

Lemma 2.3. Let G'(V ', E',w',b') be the reduced 
graph of G(V, E , w, b) obtained by merging vertices p 
and q. If a partition (C, C) does not separate p and q, 
we have NCut G (C,C) = NCut g^C',0 7 ). 
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Proof. Note that gvol(W) = gvol(r) + 
gvol(V\{p, g}) = gvol(p) + gvol(g) + gvol(V\{p, g}) = 
gvol(y). If (C, C) does not separate p and g, then we 
have either r C C or r C C. W.l.o.g. assume that 
r C C. The corresponding partition of V' is then 
C = r U [C {p, g}) and C = C. We get 


cut (c\c')= ^2 < = E + ^2 w v 

i£C' ,j £C' j £ C' iEC'\r,j £C' 

— ^ Wkj + ^2 Wij = cut(C,C). 

k£{p,q},j£C i£C\{p,q},j£C 

gvo \{C‘) =^2,b'i = b' T + ^2 

i£C' %£C'\t 

— bp + bq + ^ bi = ^2bi = gvol(C). 

ieC\{p,g} i£C 

gVol(C") = 22 h 'i = Pj bi= g v °l(C)- 
iEC 7 i£C 

Thus we have NCut^^C, C) = NCut^/ (C", C') □ 


All partitions of the reduced graph fulfill all must- 
link constraints and thus any relaxation of the un¬ 
constrained normalized cut problem can now be used. 
Moreover, this is not restricted to the cut criterion we 
are using but any other graph cut criterion based on 
cut and the volume of the subsets will be preserved in 
the reduction. 


3 Algorithm for Constrained 
1-Spectral Clustering 

In this section we discuss the efficient minimization 
of F 1 based on recent ideas from unconstrained 1- 
spectral clustering 0 ng. Note, that F 1 is a non¬ 
negative ratio of a difference of convex (d.c) function 
and a convex function, both of which are positively 
one-homogeneous. In recent work 0 E], a general 
scheme, shown in Algorithm [l] (where dS(f) denotes 
the sub differential of the convex function S at /), is 
proposed for the minimization of a non-negative ratio 
of a d.c function and convex function both of which 
are positively one-homogeneous. 

It is shown in [9 that Algorithm [l] generates a se¬ 
quence f k such that either F 1 (f k+1 ) < F 1 (f k ) or the 
sequence terminates. Moreover, the cluster points of 
f k correspond to critical points of F y . The scheme is 
given in Algorithm [l] for the problem minj e Rn (R\(f) — 


#2 (/))/£(/)> where 

Rl (f) : = \ J2 ( Wi i +7%7)l/i “ fo I 


*0 = 1 


+ A %( max (/) - min (/)) 


*0 = 1 


Mf) : = f E - u 


*0 =1 


s(f) := 5 


B(/- 


gvoi(y) 




Note that i?i, R 2 are both convex functions and 
F,(f) = (R 1 (f)-R 2 (f))/S(f). 


Algorithm 1 Minimization of a ratio (i?i(/) — 
R 2 (/))/*?(/) where R\,R 2 ,S are convex and positively 
one-homogeneous 

1: Initialization: /° = random with ||/°|| = 1, 
A 0 = (R 1 (f°)-R 2 (f°))/S(f 0 ) 

2: repeat 

3 : / fe+1 = argmin{i?i(/) - (f,r 2 ) - X k ( f,s )} 

ll/ll 2 <i 

where r 2 G dR 2 (f k ), s G dS(f k ) 

4: A fe+1 = (i?i(/ fe+1 ) - R 2 (f k+1 )/S(f k+1 ) 

5: until J -p- - < e 

6: Output: A /c+1 and f k+1 . 


Moreover, it is shown in [9], that if one wants to min¬ 
imize (Ri(f) — R 2 (f))/S(f) only over non-constant 
functions, one has to ensure that ( 7 * 2 ,1) = (s,1) = 0. 
Note, that 

n 

dR 2 {f) = { E q R Ui i I Ui i = ~ u R’ Ui i e si S n (/i “ /j)}> 

3 = 1 

where sign(x) = [— 1 , 1 ] if x = 0 , otherwise it just the 
sign function. It is easy to check that (iq 1) = 0 for all 
u G dS(f ) and all / G M n and there exists always a 
vector u G dR 2 (f ) for all / G M n such that (iq 1) = 0. 

In the algorithm the key part is the inner convex prob¬ 
lem which one has to solve at each step. In our case it 
has the form, 

.. “in, \ E ( Wi i + ^ij (4) 

II / II 2 — 1 ^ ■ -1 

* 0=1 

n 

+ | E 9&( max (/) - min (/)) - (fnr 2 + A k s^ , 

*0 = 1 

where 7*2 G dR 2 (f k ), s G dS(f k ) and A fc = F 1 {f k ). 

To solve it more efficiently we derive an equivalent 
smooth dual formulation for this non-smooth convex 
problem. We replace Wij+^q 7 ^ by w\ - in the following. 
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Lemma 3.1. Let E C V x V denote the set of 
edges and A : R E R v be defined as (Aa)i = 
j)eE Moreover, letU denote the simplex, 

U = {u G M n | Y^i=i u i = 1,> 0, Vi}. TTie above 
inner problem is equivalent to 


(aeR E 


<1 ,Q'ij— 

= 77 j 


\P(a, v) := 


(5) 


vet/ 


—A— H-'f + fr — ( — A— + v + b) 

C v c 


where c = ^vol(Q c ) ; b = ^ + A fc ^ and Pu(x) is the 
projection of x on to the simplex U. 


Proof Noting that \ E^=i w ij \h ~ fj\ = 
max {aeR^|||a|| 00 <i,a iJ -=_a ii } (/> Aol) (see |5] ) and 
max* fi = max uGt / (n, /), the inner problem can be 
rewritten as 


min max ( f , AcA 

+ c max (/, u) + c max (-/, v) - 7 (/, r 2 ) - A fc (/, s) 
uet/ vet/ 

= min max (/, A<a) 

II/II 2 — 1 /aeM^I ||a||go <l,Q!ij=—aiji 1 

1 n,vet/ 1 

+ c (/, u) - c (/, v) - 7 (/, r 2 ) - A fe (/, s) 

= max min ( f, Aa + c(u — v) — 7 r 2 — A fc s) 

a€R E | ll/ll 2 <l V V ' 7 

ll a lloo — 

Oi-ij — OLji 

u,vEU 

^ max — \\Aa — 7 r 2 — X k s + c(u — i>)|L 

aGR B |||o:|| 00 <l " 112 

a.ij=—OijiU,v£U 

= — min ^/(a,u,v). 
o:eM S |||Q:|| 00 <l 

CXij = — CXji 

u,vEU 


The step si follows from the standard min-max theo¬ 
rem (see Corollary 37.3.2 in m since u, v, a and / 
lie in non-empty compact convex sets. In the step 82 , 
we used that the minimizer of the linear function over 
the Euclidean ball is given by 

Aa — 77*2 — A k s + cu — cv 
|| Aa — yr 2 — A k s + cu — cv || 2 ’ 

if || Aa — yr — A k s + cu — cv || 7 ^ 0; otherwise /* is an 
arbitrary element of the Euclidean unit ball. 

Finally, we have \\Aa — r — X k s + cu — cv\\ = c 
|| A^ + u — v — b \\. We also know that for a convex set 
C and any given y, mm xeC \\x - y\\ = \\y - P c (y) ||, 
where Pc{y) is the projection of y onto the set 
C. With y = — A^ + v + 6 , we have for any 
<a, min U 5 veC /T(<a ,u,v) = mm veU mm ueU c\\u - y\\ = 
mm ve u c\\y — Pu{y)\\ and from this the result fol¬ 
lows. □ 


The smooth dual problem can be solved efficiently us¬ 
ing first order projected gradient methods like FISTA 
[2], which has a guaranteed convergence rate of O(p-), 
where k is the number of steps, and L is the Lipschitz 
constant of the gradient of the objective. The bound 
on the Lipschitz constant for the gradient of the objec¬ 
tive in ([ 5 ]) can be rather loose if the weights are vary¬ 
ing a lot. The rescaling of the variable a introduced 
in Lemma 13.21 leads to a better condition number and 
also to a tighter bound on the Lipschitz constant. This 
results in a significant improvement in practical per¬ 
formance. 

Lemma 3.2. Let B be a linear operator defined as 

w' 

(B/3)i := Pv and let s v = ~^ M > for posi¬ 

tive constant M > ||B||. The above inner problem is 
equivalent to 


min ^(/3,v) 

{/3E^ E \\\P\\ 00 <Si j ,l3ij=-Pji} 
v£U 


\\\d-Pu(d)\\l 


where d = -j^/3 + v-\-b. The Lipschitz constant of the 
gradient of T is upper bounded by f. 


Proof Let /% = s^a^. Then (Af)* = 

A:w,/;c/:"t'r j = Hj:(ij)€E Tf = (H/^ aIld COn - 

straints on a transform to —s^j < faj < Sij and 

Pij = —/3ij . Since the mapping between a and /3 is one- 
to-one, the transformation yields an equivalent prob¬ 
lem (in the sense that minimizer of one problem can be 
easily derived from minimizer of the other problem). 

Now we derive a bound on the Lipschitz constant. 

The gradient of T at x = (/3, v) w.r.t /3, and v are 
given by 


(V<S>(x)) 0 = ~(d-Pu(d)), 
(W(s )) v = (d-Pu(d)), 


where B T is the adjoint operator of B given by 
(■ B T d)ij = (di -dj). 

Let x' = (/3',i/) denote any other point and d! = 
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— then we have 


II WO) - W(x ')|| 2 

II oT || 2 

ii H ||( d _p [/ ( d)) _( (i '_p [/(d ')|| 2 


M 2 


+ \\(d-Pu(d))-(d' -Pu(d') 

\b t W 2 \ 

= 1 1+AtA ) II ((d-d , ) + (-P u (d) + P u (d , y" 2 


M 2 


< 2 


i + I13 


M 2 


(\\d-d / \\ 2 + \\-P u (d) + P u (d / )\\ 2 ) 


i + M 


- 4|1 + ^) |ld “ d ' ii 


= 4 
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Figure 1: Influence of 7 on cut and clustering error. 


Proof. Algorithm [l] generates {f k } such that 
F 7 C0 +1 ) < F 1 (f k ) until it terminates [9], we have 
F 1 (f 1 ) < F 7 (/°) = F 7 (C), if the algorithm does not 
stop in one step. As shown in theorem | 2 . 2 [ optimal 
thresholding of / 1 results in a partition (A, A) such 
that F 7 (A) < F 1 (f 1 ) < F 1 {C). If (7 is consistent, 
we have F 7 (/°) = F 1 (C) = NCut(C, C) = A, by 
Lemma | 2 . 2 | For the chosen value of 7 , using a 
singular argument as in Lemma | 2 . 1 [ one sees that 
for any inconsistent subset L>, F 1 {B) > A and 

hence A is consistent with Q. Then it is immediate 
that NCut(A,A) = F 7 (A) < F 7 (/ x ) < F 7 (/°) = 
NCut (C,C). □ 

In practice, the best results can be obtained by first 
minimizing F 1 for 7 = 0 (unconstrained problem) and 
then increase the value of 7 and use the previously 
obtained clustering as initialization. This process is 
iterated until the current partition violates not more 
than a given number of constraints. 

4 Soft- versus Hard-Constrained 
Normalized Cut Problem 


where neigh(r) is the number of neighbors of vertex r. 

Despite the problem of minimizing F 7 is non-convex 
and thus global convergence is not guaranteed, Algo¬ 
rithm [l] has the following quality guarantee. 

Theorem 3.1. Let (C, C ) be any partition and let A = 
NCut(C, C). If one uses 1 c as the initialization of 
the Algorithm^ 7} then the algorithm either terminates 
in one step or outputs an f 1 which yields a partition 
(A, A) such that 

F^(A) < 7(C) 

Moreover, if {C , C ) is consistent and if we set for 7 any 
value larger than gVQ h y ) A then A is also consistent and 
NCut(A,A) < NCut(C, C). 


The need for a soft version arises, for example, if the 
constraints are noisy or inconsistent. Moreover, as we 
illustrate in the next section, we use the soft version to 
extend our clustering method to the multi-partitioning 
problem. Using the bound of Lemma | 2 . 1 | for 7 , we 
can solve the soft constrained problem for any given 
number of violations. 

It appears from a theoretical point of view that, due 
to noise, satisfying all constraints should not be the 
best choice. However, in our experiments it turned 
out, that typically the best results were achieved when 
all constraints were satisfied. We illustrate this be¬ 
havior for the dataset Sonar, where we generated 80 
constraints and increased 7 from zero until all con¬ 
straints were satisfied. In Figure [l] we plot cuts and 
errors versus the number of violated constraints. One 
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observes that the best error is obtained when all con¬ 
straints were satisfied. Since by enforcing always all 
given constraints, our method becomes parameter-free 
(we increase 7 until all constraints are satisfied), we 
chose this option for the experiments. 





if ••• 

.y 







5 Multi-Partitioning with Constraints 

In this section we present a method to integrate 
constraints in a multi-partitioning setting. In the 
multi-partitioning problem, one seeks a /^-partitioning 
(Ci,...,Cfc) of the graph such that the normalized 
multi-cut given by 


Figure 2 : Left: ground-truth, middle: clustering ob¬ 
tained by unconstrained 1 -spectral clustering, right: 
clustering obtained by the constrained version. 

In the succesive binary splits, N is known, while k can 
again be derived, assuming the uniform model, as ^ h, 
where n is the size of the current component. 


k 

]TNCut (0,0) (6) 

i= 1 

is minimized. A straightforward way to generate a 
multi-partitioning is to use a recursive bi-partitioning 
scheme. Starting with all points as the initial parti¬ 
tion, the method repeats the following steps until the 
current partition has k components. 

1 . split each of the components in the current parti¬ 
tion into two parts. 

2 . choose among the above splits the one minimizing 
the multi-cut criterion. 


Now we extend this method to the constrained case. 
Note that it is always possible to perform a binary 
split which satisfies all must-link constraints. Thus, 
must-link constraints pose no difficulty in the multi¬ 
partitioning scheme, as all must-link constraints can 


be integrated using the procedure given in 2.2 


However, satisfying all cannot-link constraints is some¬ 
times not possible (cyclic constraints) and usually also 
not desirable at each level of the recursive bi-partition, 
since an early binary split cannot separate all classes. 
The issues here is which cannot-link constraints should 
be considered for the binary split in step 1 . 


To address this issue, we use the soft-version of our for¬ 
mulation where we need only to specify the maximum 
number, /, of violations allowed. We derive this num¬ 
ber l assuming the following simple uniform model of 
the data and constraints. We assume that all classes 
have equal size and there is an equal number of cannot 
link constraints between all pairs of classes. Assuming 
that any binary split does not destroy the class struc¬ 
ture, the maximum number of violation is obtained 
if one class is separated from the rest. Precisely, the 
expected value of this number, given N cannot-link 
constraints and k classes, is In the 

first binary split, these numbers (TV and k) are known. 


We illustrate our approach using an artificial dataset 
(mixture of Gaussians, 500 points, 2 dimensions). Fig¬ 
ure [ 2 ] shows on the left the ground truth and the so¬ 
lution of unconstrained ( 7 = 0 ) multi-partitioning. In 
the unconstrained solution, points belonging to the 
same class are split into two clusters while points from 
other two classes are merged into a single cluster. On 
the rightmost, the result of our constrained multi¬ 
partitioning framework with 80 randomly generated 
constraints is shown. 

6 Experiments 

We compare our method against the following four 
related constrained clustering approaches: Spectral 
Learning (SL) [IT] . Flexible Constrained Spectral 
Clustering (CSP) [H], Constrained Clustering via 
Spectral Regularization (CCSR) [ 12 ] and Spectral 
Clustering with Linear Constraints (SCLC) jT9j. SL 
integrates the constraints by simply modifying the 
weight matrix such that the edges connecting must- 
links have maximum weight and the edges of cannot- 
links have zero weight. CSP starts from the spectral 
relaxation and restricts the space of feasible solutions 
to those that satisfy a certain amount (specified by 
the user) of constraints. This amounts to solving a 
full generalized eigenproblem and choosing among the 
eigenvectors corresponding to positive eigenvalues the 
one that has minimum cost. CCSR addresses the prob¬ 
lem of incorporating the constraints in the multi-class 
problem directly by an SDP which aims at adapting 
the spectral embedding to be consistent with the con¬ 
straint information. For CSP and CCSR we use the 
code provided by the authors on their webpages. 

In SCLC one solves the spectral relaxation of the 
normalized cut problem subject to linear constraints 
mm- Cannot-links and must-links are encoded via 
linear constraints as follows [5]: if the vertices p and 
q cannot-link (resp. must-link) then add a constraint 
f p = —f q (resp. f p = f q ). Although must-links are 
correctly formulated, one can argue that the encod- 
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Dataset 

Size 

Features 

Classes 

Sonar 

208 

60 

2 

Spam 

4207 

57 

2 

USPS 

9298 

256 

10 

Shuttle 

58000 

9 

7 

MNIST (Ext) 

630000 

784 

10 


Dataset 

Size 

Features 

Classes 

Breast Cancer 

263 

9 

2 

Heart 

270 

13 

2 

Diabetis 

768 

8 

2 

USPS 

9298 

256 

10 

MNIST 

70000 

784 

10 


Table 1: UCI datasets. The extended MNIST dataset 
is generated by translating each original input image 
of MNIST by one pixel, i.e., 8 directions. 


ing of cannot-links has modeling drawbacks. First ob¬ 
serve that any solution that assigns zero to the con¬ 
strained vertices p and q still satisfies the correspond¬ 
ing cannot-link constraint although it is not feasible 
to the constrained cut problem. Moreover, one can 
observe from the derivation of spectral relaxation [16 , 
that vertices belonging to different components need to 
have only different signs but not the same value. En¬ 
coding cannot-links this way introduces bias towards 
partitions of equal volume, which can be observed in 
the experiments. 

Our evaluation is based on three criteria: clustering 
error, normalized cut and fraction of constraints vi¬ 
olated. For the clustering error we take the known 
labels and classify each cluster using majority vote. In 
this way each point is assigned a label and the clus¬ 
tering error is the error of this labeling. We use this 
measure as it is the expected error one would obtain 
when using simple semi-supervised learning, where one 
labels each cluster using majority vote. 

The summary of the datasets considered is given in 
Table [l] The data with missing values are removed 
and the fc-NN similarity graph is constructed from the 
remaining data as in [3j. In order to illustrate the 
performance in case of highly unbalanced problems, 
we create a binary problem (digit 0 versus rest) from 
USPS. The constraint pairs are generated in the fol¬ 
lowing manner. We randomly sample pairs of points 
and for each pair, we introduce a cannot or must-link 
constraint based on the labels of the sampled pair. The 
results, averaged over 10 trials are shown in Table [2] for 
2-class problems and in Table [3] for multi-class prob¬ 
lem^] In the plots our method is denoted as COSC 
and we enforce always all constraints (see discussion 
in Section [4]). Since our formulation is a non-convex 
problem, we use the best result (based on the achieved 
cut value) of 10 runs with random initializations. Ex¬ 
cept our method, no other method can guarantee to 
satisfy all constraints, even though SCLC does so in 


Table 4: Additional UCI datasets 


all cases. Our method produces always much better 
cuts than the ones found by SCLC which shows that 
our method is better suited for solving the constrained 
normalized cut problem. In terms of the clustering er¬ 
ror, our method is consistently better than other meth¬ 
ods. In case of unbalanced datasets (Spam, USPS 0 
vs rest) our method significantly outperforms SCLC in 
terms of cuts and clustering error. Moreover, because 
of hard encoding of constraints, CSLC cannot solve 
multi-partitioning problems. 


6.1 Additional experimental results 


Additional experimental results are given in Tables [5] 
and [6] for the datasets given in Table 6.1 


3 CSP could not scale to these large datasets, as the 
method solves the full (generalized) eigenvalue problem 
where the matrices involved are not sparse. 
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Table 2: Results for binary partitioning: Left: clustering error versus number of constraints, Middle: normal¬ 
ized cut versus number of constraints, Right: fraction of violated constraints versus number of constraints. 
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Table 3: Results for multi-partitioning - Left: clustering error versus number of constraints, Middle: normal¬ 
ized cut versus number of constraints, Right: fraction of violated constraints versus number of constraints. 
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Table 5: Results for binary partitioning: Left: clustering error versus number of constraints, Middle: normal¬ 
ized cut versus number of constraints, Right: fraction of violated constraints versus number of constraints. 
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Table 6: Results for multi-partitioning - Left: clustering error versus number of constraints, Middle: normal¬ 
ized cut versus number of constraints, Right: fraction of violated constraints versus number of constraints. 
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